-
Notifications
You must be signed in to change notification settings - Fork 16.3k
KubernetesPodOperator: Fix hanging API calls #60254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KubernetesPodOperator: Fix hanging API calls #60254
Conversation
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Outdated
Show resolved
Hide resolved
275d65b to
25ec50c
Compare
|
Overall solid PR. The wrapper class is a great approach. But there is one potential interaction I wanted to flag based on reading through the code path.. If you look at In idle or stalled watch scenarios (i.e. no events received for a period), this effectively means the API timeout can act as a hard upper bound on how long the watch remains open if it is lower than Normally, this would not be too concerning but as |
jscheffl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some typing hint nits, else LGTM!
Also might be good to consider the comments from @SameerMesiah97 - mainy documenting this (maybe inline in code?) I know that watches are limited to 30s which was the reason for the timeout and adding the 60s hard limit still I assume is beneficial to make it fail-safe.
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Outdated
Show resolved
Hide resolved
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Outdated
Show resolved
Hide resolved
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Outdated
Show resolved
Hide resolved
providers/google/src/airflow/providers/google/cloud/hooks/kubernetes_engine.py
Outdated
Show resolved
Hide resolved
|
@jscheffl Thank you for your input. This is a good suggestion, and I've implemented the changes accordingly. @SameerMesiah97 Thank you for the feedback. I've implemented a dynamic approach where the client-side timeout now depends on the server-side timeout. If the server-side timeout exceeds the default API_TIMEOUT, the client-side timeout is set to the server-side timeout plus an offset. This ensures the server has adequate time to respond before the client times out. I believe this is a more robust solution than simply documenting the behavior. |
a3f90ab to
718a59d
Compare
Great. That is even better as it makes the timeout behave more in line with user expectations. I do not want to sound pedantic here but I think this dynamic approach towards handling conflicting timeouts should be mentioned in the docstring for
Though the 5 second default is quite short, users scouring the logs might notice the discrepancy and think this is unintentional. |
jscheffl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it as it is already. But adding a small note does not hurt - @AutomationDev85 can you apply this nit? Then I am able to merge to the next providers wave by tomorrow.
|
@SameerMesiah97 I've adopted your text—thanks for the great improvement! |
* Kubernetes API calls use timeouts * Add wrapper Api clients to add default timeout * fix mypy and docu issues * Fixed typing and rework server side and client timeout * Extend function note --------- Co-authored-by: AutomationDev85 <AutomationDev85>
Overview
We observed intermittent hangs in Kubernetes API communication from KubernetesPodOperator, with calls stalling for over a day until tasks were manually stopped. Investigation showed the Kubernetes Python client wasn’t using a client-side timeout, so stalled connections could block indefinitely. To improve robustness, we add a client-side timeout to API calls so they raise a clear exception instead of leaving tasks hanging. This does not fix the underlying cluster/API issue, but it makes failures detectable and recoverable.
We chose a 60-second timeout: long enough to tolerate load, short enough to prevent indefinite hangs.
Change Summary